-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update sort groupby to use non-atomic operation #9035
Update sort groupby to use non-atomic operation #9035
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #9035 +/- ##
===============================================
Coverage ? 10.83%
===============================================
Files ? 114
Lines ? 19098
Branches ? 0
===============================================
Hits ? 2070
Misses ? 17028
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding test cases for dictionary column would be beneficial.
Rest looks good.
rerun tests |
* @tparam Iterator iterator type that acts as index of the output. | ||
*/ | ||
template <typename BinaryFunction, typename Iterator> | ||
class output_writer_iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a lot of machinery and I'm not clear about what it's purpose is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created this first to test, if I can set element
and set_valid
at the same time using single reduce_by_key
with thrust::optional
. But using thrust::optional
was always slow.
So, I reverted to use 2 reduce_by_key
, 1 for element, another for null_mask.
To use reduce_by_key
with null_mask, it needs a temporary bool buffer and valid_if
.
To avoid this, I used this transform_output_writer_iterator
.
Anyway, this is purely to avoid allocation of temporary bool buffer, but it doesn't affect performance much.
I will revert to using temporary bool buffer, and remove this iterator. (I thought that there may be other use cases for this iterator in cudf. Especially with null_mask.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This machinery is the bare minimum to use a proxy object for assignments in thrust. The question is if the proxy object is useful. If it is, then the machinery is just overhead.
A better name, imho, would be assignment_iterator
From what I see the use case is overriding the assignment operation. In an output iterator, the assignment operator is used to take the value on the rhs and assign it to the lhs. here, we intercept that assignment operation and call a binary operator binop(lhs, rhs)
that can override the assignment operation.
This PR uses the assignment_iterator
with a lambda that captures a null mask. The proxy intercepts the lhs int
and a rhs bool
and then invokes the lambda which calls set_valid
or set_null
on the captured mask. It is the output version of make_validity_iterator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be more specific here. I would like to see a make_validity_output_iterator
that is very focused on dealing with null masks of a particular column. With an interface like this, we can start to experiment with opportunistic coalescing of null mask assignments using cooperative groups. Also, this API makes it very obvious that we are doing individual bit assignments and that coalescing them by changing the calling code could prove more performant.
auto make_validity_output_iterator(mutable_column_device_view const& destination);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation may well use the proxy-based iterator you've created, but there would be a very clear use case for it, and other developers will have an easy to use API for that use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elstehle IIRC you've used thrust a lot... have you ever run in to an instance where this sort of assignment interception would be useful?
Sorry, I don't have a concrete use case in the back of my head right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrhemstad I'm still confused. Why exactly is cooperative groups not an option here? I was thinking we could try coalesced_threads()
, followed by labeled_partition()
and reduce()
, then use thread_rank() == 0
to indicate which threads should write the output, and use atomics to make sure uncoordinated writes don't interfere with one another. I doubt this will be fast in the majority of use cases, but I am not sure why it would not work. It would be an experiment to see if we can speed up the best-case scenario, where all/most writes are sequential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. I was under the impression cooperative groups was able to detect which threads were active
They can. Determining which threads are active is not the problem.
The problem is coordinating and detecting which threads are attempting to update the same bits in a given 4B or 8B word. For example, if any two threads in a grid want to update bits i
and i+1
, how do you detect that scenario without some form of communication? coalesced_threads
doesn't help you here because whether or not the threads are coalesced doesn't tell you anything about what bits they are updating.
Furthermore, even if the two threads are both active at the same time in the same warp (which you could never rely on) how do you detect that t0
wants to update bit i
and t1
update bit i+1
when it could just as well be t0
updates i
and t1
updates i + 1042
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's where @karthikeyann 's proxy class comes in to play. It knows the index and the value, and the lambda is responsible for using that index and value to assign the appropriate bit to a captured null mask. It appears we have enough information to attempt opportunistic concurrency here. I'm not saying it will be beneficial, only that it seems possible and might be worth an experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is coordinating and detecting which threads are attempting to update the same bits in a given 4B or 8B word.
the proxy class has the index, and we can pass this to labeled_partition
.
For example, if any two threads in a grid want to update bits i and i+1, how do you detect that scenario without some form of communication?
For the sake of the experiment, the communication would be limited to 32 threads within a warp. Communicating across warps would require shmem, and since we're in a lambda called by thrust, I'm not sure if/how that would work.
coalesced_threads doesn't help you here because whether or not the threads are coalesced doesn't tell you anything about what bits they are updating.
coalesced_threads
is just to prevent UB/hang, labeled_partition
would be responsible for determining what writes can be coalesced by using idx/32
as the label (with some offset, if necessary).
Did you run any benchmarks before/after this? |
Benchmark Comparison: (Time in ms)
|
struct null_as_sentinel { | ||
column_device_view const col; | ||
size_type const SENTINEL; | ||
__device__ size_type operator()(size_type i) const { return col.is_null(i) ? SENTINEL : i; } | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't null_replacement_iterator
be used instead?
cudf/cpp/include/cudf/detail/iterator.cuh
Line 162 in dfe0a03
auto make_null_replacement_iterator(column_device_view const& column, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't. null_replacement_iterator
returns values of the column. Here, indices are needed.
* @tparam T Type of the underlying column. For dictionary column, type of the key column. | ||
*/ | ||
template <typename T> | ||
struct null_replaced_value_accessor : value_accessor<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here. Does the existing null_replacement_iterator
not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null_replacement_iterator
right now doesn't support dictionary columns. This functor does. If null_replacement_iterator
is updated to support dictionary too, it will add to all kernels using it.
Can I add dictionary support to null_replacement_iterator<T>
(T is underlying type, not dictionary32 for dictionary type)?
(could be another PR, column_device_view::begin<T>()
could be updated too. It would provide wide support for dictionary columns in most algorithms.
This needs all benchmarks comparison too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidwendt thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use this?
cudf/cpp/include/cudf/dictionary/detail/iterator.cuh
Lines 110 to 112 in 8b02ca3
template <typename KeyType> | |
auto make_dictionary_pair_iterator(column_device_view const& dictionary_column, | |
bool has_nulls = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use the indices for any cudf operations where possible for both run-time and compile-time performance. For example, sorting in general only needs the indices.
You can use this function
column_view get_indices_annotated() const noexcept; |
to get the indices column_view decorated with the offset, size, and validity-mask appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash groupby produces base type column as output.
If we use gather with ARGMIN, or ARGMAX for MIN, or MAX, it would create dictionary column. (added one more test for this, and updated sort groupby to fix this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds correct to me. Aggregates like min/max return values that already exist in the column so the output would have the same keys as the input. Whereas, sum/prod create totally new values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, here is an example using the the dictionary-pair-iterator along with a null-replacement transformer.
cudf/cpp/src/reductions/simple.cuh
Lines 142 to 146 in f0fa255
auto f = simple_op.template get_null_replacing_element_transformer<ResultType>(); | |
auto p = | |
cudf::dictionary::detail::make_dictionary_pair_iterator<ElementType>(*dcol, col.has_nulls()); | |
auto it = thrust::make_transform_iterator(p, f); | |
return detail::reduce(it, col.size(), simple_op, stream, mr); |
I'm inclined to prefer your approach here instead since it simplifies the caller to one value-accessor. The only thing that makes me nervous is that
col.element<dictionary32>(i)
would be included/inlined for every type and that function contains it's own type-dispatcher call in it. But technically every type is potentially a dictionary key type so I think the same amount of code is generated either way. Anyway, it may be worth looking into using this null-replacement accessor in the reductions code too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dictionary32
means 32 bit int index right?
why is there another type dispatcher for col.element<dictionary32>(i)
if index type is already known?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dictionary index types can technically be any unsigned integer type. The element<dictionary32>(i)
always returns an int32 value regardless of the underlying indices type.
https://github.com/rapidsai/cudf/blob/branch-21.10/cpp/include/cudf/column/column_device_view.cuh#L415-L421
rerun tests |
1 similar comment
rerun tests |
In trying to confirm that this enables
will wait for fix. |
@codereport The issue is fixed. |
I have half the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhanced sort-based group_by
for decimal128
work with this PR :)
@gpucibot merge |
This PR replaces
update_target_element
withreduce_by_key
in sort groupby reduce_functor. (to allow decimal128 sort groupby)Operations updated are
Compilation time increased from 1m18s to
3m28s1m27s.With major compilation time taking 184s for group_argmin.cu, group_argmax.cu each. (now trying to reduce this time)reduced compile time of group_argmin.cu, group_argmax.cu to 70s each.